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Abstract 

This paper demonstrates how the challenging problem of the Arabic broken plurals and 
diminutive can be handled under a multi-tape two-level model, an extension to two-level 



in 

morphology. 

^ 1 Introduction 

The phenomenon of the broken phiral and diminutive in Arabic poses a challenge to main-stream 
0^ two-level morphology, not only because of its nonconcatenative nature, but also because its analysis 

relies heavily on prosodic structure. The purpose of this paper is to present an implemented 
morphological model which is capable of analysing the broken plural. 

The following convention has been adopted. Morphemes are represented in braces, { }, and sur- 
face forms in soHdi, / /. In examples of grammars, variables begin with a capital letter. Cs denote 
consonants and Vs denote vowels. In two-level rules, square brackets mark optional segments. 

The structure of the paper is as follows: section 2 presents the problem of the broken plural; 
section 3 introduces a computational framework for solving the problem; section 4 demonstrates 
how the broken plural may be derived via the 'implicit derivation' of the singular under a two-level 
model; finally, section 5 gives concluding remarks. 



2 Problem Description 

The derivation of the broken plurals and diminutive in Arabic is a complicated task. Consider the 
data in (1) - from McCarthy and Prince (1990a). 

(1) Broken Plural and Diminutive Data 
Singular Plural Diminutive Gloss 



jundub janaadib junaydib 'locust' 
sultaan salaatiin sulaytaan 'sultan' 



The analysis of the plural makes use of prosodic structure, where a word (W) consists of at least 
one foot (F), and feet consist of syllables (a) as in (2). 



* Supported by a Benefactor Studentship from St John's College. This research was done under the supervision 
of Dr Stephen G. Pulman. 
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(2) Prosodic Hierarchy 



W 




F F 



O" O" 




j u n d u b 



McCarthy and Prince (1990a) show that broken plurals have the plural template in (3).^ 

(3) Plural Template 
F 




C a C a a 



The derivation of the plural from the singular takes the following path. The singular (e.g. 
/jundub/) is factored into two partitions: a kernel which consist of the first foot (i.e. /jun/) 
and a residue which is the remainder (i.e. /dub/). The kernel /jun/ is mapped onto the plural 
template: the consonants link to the Cs, and the default plural vowel [a] remains. This results in 
the representation in (4). 

(4) Deriving /janaadib/ 



F 




(T cr 

A 

J a n a a 

Then, the residue /dub/ is added. This is straightforward, except that the default vowel for the 
residue is [i] (i.e. /dub/ becomes /dib/) resulting in /janaadib/. The derivation of /salaatiin/ 
from /sultaan/ follows the same path. The kernel /sul/ maps to the plural template producing 
/salaa/, and the residue /taan/ is added with the vowel [i]; hence, /salaatiin/. The diminutive can 
be derived in a similar fashion. 

Traditional grammars provide a much simpler analysis. Wright (1988) hsts 31 broken plural 
patterns. Each root must be lexicahy marked with its own pattern. For example, the root {jndb} 
is marked with the pattern CaCaaCiC, and {sltn} is marked with CaCaaCiiC. Instantiating the 
Cs with the root consonants produces /janaadib/ and /salaatiin/, respectively. The analysis of 
McCarthy and Prince, however, provides a wider generalisation: the operation of pluralisation, 
which can be achieved by morphological rules in a computational framework, rather than by 
exphcit, manual marking of lexical entries. 



3 Computational Framework 

This section presents a computational framework under which the broken plural can be analysed. 
Section 3.1 gives some background to Arabic computational morphology, and section 3.2 defines 
our model. 

3.1 Background 

Two-level morphology (Koskenniemi 1983) defines two levels of strings in recognition and synthesis: 
lexical strings represent morphemes, and surface strings represent surface forms. Two-level rules 
map the two strings; the rules are compiled into finite state transducers, where lexical strings sit 
on one tape of the transducers and surface strings on the other. 



^The default plural vowel melody of the broken plural is {ai} except in CVCC and CVCVC singular stems. 
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Traditional two-level models are biased towards concatenative morphology since they assume 
that the lexical string is the concatenation of lexical morphemes. This requirement makes it 
extremely difficult, if not impossible, to analyse the data presented in section 2. 

A number of proposals for handling Arabic morphology exist-^; none provide a framework 
under which the phenomenon of the broken plural can be analysed in a linguistically motivated 
method. An implemented system which can handle the broken plural is the ALPNET system 
reported by (Beesley et al. 1989; Beesley 1990; Beesley 1991). In this system, root entries in the 
lexicon are associated with a set of nominal patterns, some of which indicate the broken plural 
(Bccslejf, personal communication). For example, the root {j_n_d_b} (where _ represents an infix) 
is associated with the patterns {_u__u_} 'singular' and {_a_aa j_} 'plural' (where _ represents a root 
consonant). The intersection of the root morpheme and pattern morphemes produces /jundub/ 
and /janaadib/, respectively. 

There is no doubt that the ALPNET system can produce broken plurals. However, its handling 
of the broken plural follows traditional grammars (the ALPNET project started before the new 
findings of - see section 2 - were published) . It is more desirable to have a morphological model 
which can capture the generahsations in these findings. Under this framework, the operation of 
'pluraHsation' would be handled by two-level rules, rather than marking each root entry in the 
lexicon with its broken plural pattern(s). Hence, lexicon maintenance would be easier and more 
efficient, specially that pluralisation is productive in Arabic, e.g. /sandal/ -^— /sanaadil/ 'sandal'. 
The next subsection introduces such a model. 



3,2 A Computational Model 

There are two main difficulties in handling broken plurals. The first is choosing a formalism which 
allows the user to write two-level rules for deriving broken plurals along the fines described in 
section 2. The formalism must allow the mapping between lexical and surface strings of unequal 
lengths. For example, in the derivation of /janaadib/ from /jundub/, one needs to map the kernel 
/jun/ to /janaa/. A formalism which fits this criterion appears in (5) (Ruessink 1989; Pulman and 
Hepple 1993).=^ 

(5) Two-Level Formalism 

a. LSC - Surf - RSC ^ LLC - Lex - RLC 

b. LSC - Surf - RSC o LLC - Lex - RLC 
where 

LSC = left surface context LLC = 
Surf = surface form Lex = 

RSC = right surface context RLC = 



left lexical context 

lexical form 

right lexical context 



The special symbol * indicates an empty context, which is always satisfied. The operator => states 
that Lex may surface as Surf in the given context, while the operator O adds the condition that 
when Lex appears in the given context, then the surface description must satisfy Surf. The latter 
caters for obligatory rules. A lexical string maps to a surface string iff they can be partitioned into 
pairs of lexical-surface subsequences, where each pair is licenced by a rule. Rules are associated 
with a feature structure which must unify with the lexical feature structure of the morpheme 
affected by the rule. 

In order to illustrate how the formalism can be used for deriving broken plurals, let us ignore 

for the moment the tcmplatic nature of Arabic morphology. Assume that the lexicon maintains 
the singular stem /jundub/. Two rules are required for the derivation of the plural /janaadib/ (6). 



^Previous works include: Kay (1987), Kornai (1991), Wiebe (1992), Narayanan and Hashem (1993), Pulman and 
Hepple (1993), and Bird and Ellison (1994). 

^Our implementation interprets rules directly; hence, we allow unequal representation of strings. If the rules 
were to be compiled into automata, a genuine symbol, e.g. 0, must be introduced by the rule compiler. For the 
compilation of our formalism into automata, see Grimley-Evans, Kiraz & Pulman, forthcoming. 
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(6) Two-Level Rules 
Rl: 



R2: 
R3: 



* - C1V1C2 - C3V2[V2]C4 O 

CiaC2aa - * 



C1V1C2 - C3V2C4 - * o 



* - C3iC4 - * 



C1V1C2 - C3V2V2C4 - * o 
* - C3iiC4 - * 



Rl maps the kernel /jun/ to /janaa/ and R2 maps the residue /dub/ to /dib/. R3 is similar to R2 
except that it sanctions residues with a long vowel, e.g. /taan/ in /sultaan/. The lexical contexts 
ensure that the proper rule is applied to the kernel, B:$, or the residue, B/$. (7) illustrates the 
two-level derivations of /janaadib/ and /salaatiin/. 

(7) Two-Level Analysis of /janaadib/ and /salaatiin/ 

Lexical Tape 



jun 



dub 



sul 



taan 



Rl R2 



Rl R3 



janaa 



dib 



salaa 



tnn 



Surface Tape 



The second difficulty in analysing the broken plural is a result of the nature of Arabic mor- 
phology, where the majority of Arabic stems are templatic, i.e. they are derived from a root and a 
vowel melody according to a specific template (or pattern). For example, the singular /jundub/ is 
derived from the root morpheme {jndb} and the vowel melody morpheme {u}; both are arranged 
according to the template morpheme {CVCCVC}. This derivation is illustrated in (8).^ 

(8) Derivation of /jundub/ 
u 

/jundub/ = ^ ^^fT^r c 

J n d b 



Kiraz (1994) proposed some extensions to the above formaHsm in order to handle Arabic templatic 
morphology. One of the extensions introduced is that all expressions in the lexical side of the rules 
(i.e. LLC, Lex and RLC) are n-tuple of regular expressions of the form (xi, X2, . . ., x„). The ith 
expression refers to symbols on the ith tape. When n = 1, the parentheses can be ignored; hence, 
(x) and X are equivalent; a nil slot is indicated by e. (The original idea of using multiple tapes is 
due to Kay (1987).) 

Assuming that the lexicon maintains the roots {jndb} and {sltn} (each root is associated with 
the feature [number=N] ), the vocalisms {uu} and {uaa},^ and the patterns {cvccvc} and {cvccvvc}, 
the derivation of singular stems can be achieved by the rules in (9) . 

(9) Two-Level Rules 

* - (c, X, e) - * => Features: [number=sing] 



R4: * _ 



* 



(v, e, X) - * => Features: [number=sing] 



* - (v, e, X) - * 

* _ X - * 



R4 sanctions consonants and R5 sanctions vowels. The two-level derivation of /jundub/ and 
/sultaan/ appears in (10). 



■^The template morpheme is presented here in CV terms in order to simplify the presentation and in order to 
concentrate on the broken plural issue. Templates can be specified by prosodic terms such as syllable and mora 
(McCarthy and Prince 1990b; McCarthy 1993). 

^For simplicity, the vocalism {u} and {ua} are entered as above to avoid writing spreading rules; for handling 
spreading, see Kiraz (1994). 
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(10) Two-Level Analysts oe /jundub/ and /sultaan/ 



11 


11 
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r. 


r. 


V 


c 


R4 


R5 


R4 


R4 


R5 


R4 


J 


u 


n 


d 


u 


b 



u 


a 


a 




s 


1 


t 


n 


c 


V 


c 


c 


V 


V 


c 


R4 


R5 


R4 


R4 


R5 


R5 


R4 


s 


u 


1 


t 


a 


a 


n 



Vocalism Tape (VT) 
Root Tape (RT) 
Pattern Tape (PT) 

Surface Tape (ST) 



So far, we have seen how it is possible to derive plurals from singular stems, and how to derive 
singular stems from morphemes (i.e. roots, vocalisms and patterns). The next section looks into 
how this model is capable of deriving plural forms from morphemes, rather than from singular 
stems. 



4 Analysis of the Broken Plural 



Because of its dependency on the singular (which does not constitute a lexical entry), the broken 
plural seems to require a three-level derivation. In other words, the derivation of the broken plural 
takes the following form: (root, vocahsm, pattern) singular plural. There are two methods 
for deriving the plural (the choice depends on the rest of the grammar). The first requires a 
three-level model as illustrated in (11a); it is possible to collapse the three representations into 
two via composition. The second derives the plural directly from the root, pattern and vocalism 
morphemes via the implicit derivation of the singular using just our two-level model. The 
concept is illustrated in (lib). 

(11) Two-level vs Three-level Derivation 



a. Three level System 

Pattern, Root, Vocalism 
1. 



Two-Level Rules 



Singular 
[ 



b. Two-level System 

Pattern, Root, Vocalism 
1. 



Two-Level Rules 

T 




Plural 



Plural 



Deriving the broken plural via the 'implicit derivation' of the singular is best explained by an 
example. To derive /janaadib/ (singular /jundub/), a two-level rule scans the kernel, i.e. /jun/, 
from lexical morphemes; however, instead of mapping the kernel to /jun/ on the surface, the rule 
maps it to the plural template CaCaa, i.e. /janaa/. A second rule maps the rest of the morpheme 
characters, i.e. the residue /dub/, to the surface as they are, but overwriting the singular vowel 
melody with the plural one. The derivation, then, remains within two-level theory. We say that 
the plural is derived via the 'implicit derivation' of the singular because the two-level rules find 
a singular form, but map it to the corresponding plural form on the surface. To illustrate this 
process, consider the rules in (12). 

(12) Two-Level Rules 



* - CiaC2aa 



(CVC, C1C2, Vi) - (CV[V]C, C3C4, V2[V2]) O 



(cvc, C1C2, Vi) - (cvc, C3C4, V2) - * O 
* - C3iC4 - * 



(cvc, C1C2, Vi) - (CVVC, C3C4, V2V2) - * O 

* - C3iiC4 - * 



R6-R8 perform the same thing as R1-R3, except that (1) they operate on three lexical tapes instead 
of one, and (2) they are all marked with the feature [number=pl] . The two-level derivation of 
/janaadib/ and /salaatiin/ under this scheme is illustrated in (13). 
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(13) Two-Level Analysts oe /janaadib/ and /salaatiin/ 



u 


u 


jn 


db 


cvc 


cvc 


R6 R7 


janaa 


dib 



11 


aa 


VT 


si 


tn 


RT 


cvc 


cvvc 


FT 


R6 


R8 




salaa 


tiin 


ST 



A few points should be noted: Firstly, R6-R8 are implicitly finding a singular derivation from the 
lexicon, but mapping such derivation to the corresponding plural form on the surface; note that 
the vowels which appear on VT are those of the singular /jundub/ and /sultaan/. Secondly, the 
rules are all obligatory. Thirdly, the rule feature structure [number=pl] must unify with the lexical 
feature structure [number=N] . Finally, the lexical structures in (13) and (10) are equivalent. 

Since Vs on the VT tape can unify with any vowel in the lexicon, the two-level module produces 
many analyses depending on the number of vocalisms in the lexicon. This is solved by associating 
lexical entries with feature structures for morphotactic parsing. Only the analyses with the proper 
singular vocalic melody are parsed successfully using a unification-based morphotactic grammar 
(Bear 1986; Ritchie et al. 1992). 



5 Conclusion 

I have presented in this paper a computational framework which can handle Arabic broken plurals 
in a linguistically-motivated method. The implementation has been tested on all classes of stems, 
even those which require the insertion of a default consonant [w], e.g. /jawaamiis/ 'buffaloes', and 
cases where this consonant is reahsed as [?], e.g. /jazaa?ir/ 'islands' (Kiraz 1996). Diminutive and 
the phenomenon of the masadir can be handled in a similar manner. 
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